14 research outputs found

    Who Spoke What? A Latent Variable Framework for the Joint Decoding of Multiple Speakers and their Keywords

    Full text link
    In this paper, we present a latent variable (LV) framework to identify all the speakers and their keywords given a multi-speaker mixture signal. We introduce two separate LVs to denote active speakers and the keywords uttered. The dependency of a spoken keyword on the speaker is modeled through a conditional probability mass function. The distribution of the mixture signal is expressed in terms of the LV mass functions and speaker-specific-keyword models. The proposed framework admits stochastic models, representing the probability density function of the observation vectors given that a particular speaker uttered a specific keyword, as speaker-specific-keyword models. The LV mass functions are estimated in a Maximum Likelihood framework using the Expectation Maximization (EM) algorithm. The active speakers and their keywords are detected as modes of the joint distribution of the two LVs. In mixture signals, containing two speakers uttering the keywords simultaneously, the proposed framework achieves an accuracy of 82% for detecting both the speakers and their respective keywords, using Student's-t mixture models as speaker-specific-keyword models.Comment: 6 pages, 2 figures Submitted to : IEEE Signal Processing Letter

    Time-varying sinusoidal demodulation for non-stationary modeling of speech

    No full text
    Speech signals contain a fairly rich time-evolving spectral content. Accurate analysis of this time-evolving spectrum is an open challenge in signal processing. Towards this, we visit time-varying sinusoidal modeling of speech and propose an alternate model estimation approach. The estimation operates on the whole signal without any short-time analysis. The approach proceeds by extracting the fundamental frequency sinusoid (FFS) from speech signal. The instantaneous amplitude (IA) of the FFS is used for voiced/unvoiced stream segregation. The voiced stream is then demodulated using a variant of in-phase and quadrature-phase demodulation carried at harmonics of the FFS. The result is a non-parametric time-varying sinusoidal representation, specifically, an additive mixture of quasi-harmonic sinusoids for voiced stream and a wideband mono-component sinusoid for unvoiced stream. The representation is evaluated for analysis-synthesis, and the bandwidth of IA and IF signals are found to be crucial in preserving the quality. Also, the obtained IA and IF signals are found to be carriers of perceived speech attributes, such as speaker characteristics and intelligibility. On comparing the proposed modeling framework with the existing approaches, which operate on short-time segments, improvement is found in simplicity of implementation, objective-scores, and computation time. The listening test scores suggest that the quality preserves naturalness but does not yet beat the state-of-the-art short-time analysis methods. In summary, the proposed representation lends itself for high resolution temporal analysis of non-stationary speech signals, and also allows quality preserving modification and synthesis

    Perception of time-varying signals: timbre and phonetic JND of diphthong

    No full text
    In this paper we propose a linear time-varying model for diphthong synthesis based on linear interpolation of formant frequencies. We, thence, determine the timbre just-noticeable difference (JND) for diphthong /a I/ (as in ‘buy’) with a constant pitch excitation through perception experiment involving four listeners and explore the phonetic JND of the diphthong. Their JND responses are determined using 1-up-3-down procedure. Using the experimental data, we map the timbre JND and phonetic JND onto a 2-D region of percentage change of formant glides. The timbre and phonetic JND contours for constant pitch show that the phonetic JND region encloses timbre JND region and also varies across listeners. The JND is observed to be more sensitive to ending vowel /I/ than starting vowel /a/ in some listeners and dependent on the direction of perturbation of starting and ending vowels

    LINEAR PREDICTION BASED DIFFUSE SIGNAL ESTIMATION FOR BLIND MICROPHONE GEOMETRY CALIBRATION

    No full text
    Spatial cross coherence function between two locations in a diffuse sound field is a function of the distance between them. Earlier approaches to microphone geometry calibration utilizing this property assume the presence of an ambient noise source. Instead, we consider the geometry estimation using a single acoustic source (not noise) and show that late reverberation (diffuse signal) estimation using multi-channel linear prediction (MCLP) provides a computationally efficient solution to geometry estimation. The idea behind this is that, the component of a reverberant signal corresponding to late reflections satisfies the diffuse sound field properties, which we exploit for distance estimation between microphone pairs. MCLP of short-time Fourier transform (STFT) coefficients is used to decompose each microphone signal into early and late reflection components. Cross coherence computed between the separated late reflection components is then used for pair-wise microphone distance estimation. Multidimensional scaling (MDS) is then used to estimate the geometry of the microphones from pair-wise distance measurements. We show that, higher reverberation, though detrimental to signal estimation, can aid in microphone geometry estimation. Estimated position error of less than 2 cm is achieved using the proposed approach for real microphone recorded signals

    Event-triggered sampling using signal extrema for instantaneous amplitude and instantaneous frequency estimation

    No full text
    Event-triggered sampling (ETS) is a new approach towards efficient signal analysis. The goal of ETS need not be only signal reconstruction, but also direct estimation of desired information in the signal by skillful design of event. We show a promise of ETS approach towards better analysis of oscillatory non-stationary signals modeled by a time-varying sinusoid, when compared to existing uniform Nyquist-rate sampling based signal processing. We examine samples drawn using ETS, with events as zero-crossing (ZC), level-crossing (LC), and extrema, for additive in-band noise and jitter in detection instant. We find that extrema samples are robust, and also facilitate instantaneous amplitude (IA), and instantaneous frequency (IF) estimation in a time-varying sinusoid. The estimation is proposed solely using extrema samples, and a local polynomial regression based least-squares fitting approach. The proposed approach shows improvement, for noisy signals, over widely used analytic signal, energy separation, and ZC based approaches (which are based on uniform Nyquist-rate sampling based data-acquisition and processing). Further, extrema based ETS in general gives a sub-sampled representation (relative to Nyquistrate) of a time-varying sinusoid. For the same data-set size captured with extrema based ETS, and uniform sampling, the former gives much better IA and IF estimation. (C) 2015 Elsevier B.V. All rights reserved

    Reverberation Robust TDOA Estimation using Convex Region Prior

    No full text
    The Time Difference of Arrival (TDOA) of an acoustic source with respect to a given pair of microphones is typically estimated as the maximizer of a Measure of Synchrony (MoS) between the pair of microphone signals, within the interval -D/v, D/v], where D is the inter-microphone distance and v is the speed of sound. In practical enclosures, phantom sources are created due to reverberation, which can cause false peaks in the MoS, leading to erroneous TDOA estimates. In several practical enclosures such as meeting rooms, conference halls, lecture halls, etc., the regions which can admit an acoustic source are restricted, because of furniture and other fixtures. Consequently, it is possible to acquire prior knowledge of the region which can accommodate an acoustic source. In this paper, we propose an approach to utilize this prior knowledge of source region for accurate TDOA estimation. We transform the prior knowledge of source region into an interval of possible TDOAs using a newly developed concept of family of hyperboloids. This interval, referred to as Region Constrained TDOA Interval (RCTI) is shown to be smaller than -D/v, D/v]. TDOA is then estimated as the maximizer of a suitable MoS within the RCTI. We demonstrate that estimating TDOA within RCTI, with generalized cross correlation (GCC) class of functions as MoS, is more robust to reverberation than estimating TDOA within -D/v, D/v]

    Blocking artifacts in speech/audio: Dynamic auditory model-based characterization and optimal time-frequency smoothing

    No full text
    We revisit the problem of blocking artifacts and their suppression in generic frame-based speech/audio applications. We provide a perceptual characterization of the artifacts by using dynamic auditory models. We propose some short-time-Fourier-transform-based magnitude and phase smoothing techniques and show that localized time-frequency smoothing suppresses the artifacts to a large extent. Our experiments show that magnitude smoothing is superior to phase smoothing and that the latter turns out to be only detrimental to the signal quality. We provide some examples on natural speech and audio signals in the context of compression. (C) 2008 Elsevier B.V. All rights reserved

    Feature Selection and Model Optimization for Semi-supervised Speaker Spotting

    No full text
    We explore, experimentally, feature selection and optimization of stochastic model parameters for the problem of speaker spotting. Based on an initially identified segment of speech of a speaker, an iterative model refinement method is developed along with a latent variable mixture model so that segments of the same speaker are identified in a long speech record. It is found that a GMM with moderate number of mixtures is better suited for the task than a large number mixture model as used in speaker identification. Similarly, a PCA based low-dimensional projection of MFCC based feature vector provides better performance. We show that about 6 seconds of initially identified speaker data is sufficient to achieve > 90 % performance of speaker segment identification
    corecore